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Statisticians,  economists,  and  system  engineers  are  becoming  aware 
that  to  identify  models  for  time  series  and  dynamic  systems,  information 
theoretic  ideas  can  play  a  valuable  (and  unifying)  role.  This  paper  discusses 
how  models  for  a  univariate  or  multivariate  time  series  Y(t)  can  be  formulated 
as  hypotheses  about  the  information  divergence  between  alternative  models 
for  the  conditional  probability  density  of  Y(t)  given  various  bases 
involving  past,  current,  and  futu‘”e  values  of  Y(*)  and  related  time  series 
X(*)-  To  determine  sets  of  variables  that  are  sufficient  to  forecast  Y(t), 
and  thus  to  determine  a  model  for  Y(t),  an  approach  is  presented  which 
estimates  and  compares  various  information  increments.  These  information 
numbers  play  a  central  role  in  studies  of  causality  and  feedback.  Approximating 
autoregressive  schemes  are  used  to  form  estimators  of  the  many  information 
numbers  that  one  might  compare  to  identify  models  for  a  time  series. 


Research  supported  by  Office  of  Naval  Research  under  contract  no. 
N00014-82-MP-20001 


□  CXi 


0.  Introduction 


In  applications  of  statistical  theory,  it  is  important  to  distinguish 
between  the  problem  of  parameter  estimation  (which  belongs  to  confirmatory 
statistical  theory)  and  the  problem  of  model  identification  (which  belongs 
to  exploratory  statistical  theory).  The  modeling  problem  arises  in  conven¬ 
tional  (static)  statistics  whenever  the  researcher's  goal  is  to  screen 
variables  (that  is,  to  determine  which  variables  (for  which  measurements 
exist)  are  most  associated  with  specified  variables  which  we  seek  to 
explain,  forecast,  or  control).  Researchers  are  becoming  aware  [see  IFAC 
(1982)]  that  to  identify  models  for  time  series  and  dynamic  systems, 
information  theoretic  ideas  can  play  a  valuable  (and  unifying)  role  [see 
Akaike  (19'77)].  The  thrust  has  been  clearly  articulated,  but  how  to  carry 
it  out  has  not  been  clear.  That  entropy  ideas  have  a  role  in  spectral 
estimation  is  being  widely  stated;  however,  in  my  view  the  nature  of  the  role 
is  not  well  understood  by  most  users  of  spectral  estimation  techniques. 

This  paper  does  not  discuss  entropy-based  spectral  estimation  [see  Parzen 
(1982)];  it  is  concerned  with  identifying  time  domain  models  for  univariate 
and  multivariate  time  series  by  estimating  suitable  information  measures. 

Most  of  the  calculations  proposed  are  in  the  time  domain.  But  spectral 
density  concepts  and  calculations  are  also  used. 

Section  1  states  the  definition  of  various  information  measures  for 
probability  densities  and  for  random  variables.  The  conjectured  ease  of 
calculating  significance  levels  for  tests  of  hypotheses  by  estimating 
information  increments  is  illustrated  for  the  problem  of  testinq  independence 
of  normal  random  variables  using  sample  correlation  coefficients. 


The  formulation  of  tests  for  white  noise  and  ARMA  models  in  terms  of 
information  measures  is  discussed  in  sections  2  and  3.  Multiple  time  series 
identification  is  discussed  in  section  4,  and  illustrated  by  an  example  in 
section  5, 

Analysis  of  empirical  time  series  using  the  information  measures 
discussed  in  this  paper  has  been  implemented  in  our  computer  subroutine 
library  TIMESBOARD  of  time  series  analysis  programs  which  is  the  creation 
of  Professor  H.  J.  Newton.  I  would  like  to  express  my  appreciation  to 
Dr,  Newton  for  his  close  collaboration  in  this  research  program.  The 
work  of  Parzen  and  Newton  (1980)  provides  a  foundation  for  section  4  of 
this  paper. 


Role  of  Information  Measures  in  Model  Identification 
The  concept  of  information  theory  most  familiar  to  statisticians  is  the 
entropy,  denoted  H(f),  of  a  continuous  distribution  with  probability  density 
f(x).  -oo<x«»>,  defined  by  [log  is  taken  with  base  e] 

H(f)  =  /“  {-  log  f(x)}  f(x)  dx. 

*00 

A  more  general  concept  is  information  divergence  I(f;g)  of  a  density 
g(x),  usually  representing  a  model,  from  a  density  f(x),  usually 
representing  the  true  density.  We  define 

i(f;g)  =  /“{-log  f(x)  dx. 

To  express  information  divergence  in  terms  of  entropy,  define  the  cross- 
entropy  H(f;g)  of  f{‘)  and  g(*)  by 

H(f;g)  =  r  {-lo*!  9{x)}  f(x)  dx. 

*00 

Information-divergence  has  the  important  decomposition 

(*)  0  <  I(f;g)  =  H(f;g)  -  H(f). 


There  is  an  important  relation  between  entropy  and  measures  of  deviation 
(scale  parameter)  denoted  a.  A  location-scale  parameter  model  for  a  density 


to 
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f(x)  =  -  f„(— ) 

'  '  o  O'  a  ' 


where  ^q(0  is  a  known  density,  and  y  and  a  are  parameters  to  be  estimated. 


One  may  verify  that 


H(f)  =  log  0  +  H(f^) 


For  a  normal  distribution,  the  standard  density  fQ(x)  is  usually  defined  by 
fo(x)  =  4'(x)  =  —  exp  -  j  x^; 

then  H(f)  «  log  a  +  ^  {1  +  log  2t[}  .  A  new  standardization  of  the  normal 
distribution  proposed  by  Stigler  (1982)  is  the  density 


-TlX' 


Then 


fo(x)  =  e- 


H(fp)  =  0.5,  and  H(f)  =  log  o  +  0.5. 


One  of  the  aims  of  this  paper  is  to  point  out  that  many  familiar 
statistics  for  testing  hypotheses  about  the  models  fitting  data  can  be 
formulated  as  entropy-di f ference  statistics.  Thus  an  F-test  forms 


r  =  oj  .  52 


where  5?  is  an  estimator  of  a  variance  of  a  normal  distribution. 
Instead  of  F,  consider  Fisher's  original  proposal  to  form 
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Z  =  ^  log  F  =  log  -  log  02 

We  can  write  Z  =  H-j  -  H2,  where  Hj  is  an  estimator  of  entropy  based  on  oy 
In  words,  Z  is  a  di f ference  of  two  different  estimators  of  entropy.  Our 
aim  in  this  paper  is  to  systematically  develop  statistics  for  testing  model 
identification  hypotheses  which  can  be  interpreted  as  entropy-difference 
statistics.  The  entropy-difference  statistics  that  arise  in  time  series  can 
be  further  interpreted  as  measuring  information.  We  outline  various  facts 
which  justify  a  conjecture  that  information-based  test  statistics  have 
similar  distributions. 

We  next  define  information  measures  for  random  variables  and  time 
series.  For  a  continuous  random  variable  Y  with  probability  density  fy(y), 
the  entropy  of  Y  is  defined  by 

H(Y)  =  H(fY) 

For  a  continuous  random  variable  Y  and  continuous  random  vector  X  the 
conditional  entropy  of  Y  given  X  is  defined 

H(Y|X)  =  H(fY|j^)  =  H(fY|j^) 

Explicitly,  when  X  is  a  random  variable. 

Ex  H  (fyix)  =  /  H(fy|j^^j^)  fj^  (x)  dx 


where 
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^^^Y|X=x^  /  '^Y|X=x 


The  information  I(Y|X)  about  a  continuous  random  variable  Y  in  a 
continuous  random  variable  X  is  defined  by 


I(YIX)  = 


UfYixi 


Ex  I  (^'Y|X’  ^Y^ 

=  r  UfY|x=x’  '•'y^  ^X^^^ 


A  fundamental  fact  is  that 


I(Y|X)  =  H(Y)  -  H(Y|X) 


^  ^^YlX=x’  ^Y^  ""  ^^^YlX=x’  ^Y^  ■  '^^^Y|X=x^ 


Take  expectation  with  respect  to  X  and  verify  that 


^  ^^^YlX=x’  ^Y^  ^X^^^ 


=  /”  /”  {-log  ^Y^Y)}fx,Y 

“•OO  — oo 

The  most  fundamental  concept  used  in  identifying  models  by  estimating 
information  is  I(Y|Xi;  Xj,  X2),  the  information  about  Y  X2  conditional  on  Xi; 
it  is  defined,  by  analogy  with  equation  {**), 


{***)  I(Y|Xi;  Xi.  X2)  = 

I 

=  H(YlXi)  -  H(Y|Xi.  X2). 

A  fundamental  formula  to  evaluate  I(YjXi;  Xi,  X2)  is 

(****)  KYlXi;  Xi,  X2)  =  KYlXj,  X2)  -  I(Y|Xi) 

When  X  and  Y  are  jointly  normal  random  variables,  fyiy-  (y)  is  a  normal 
distribution  whose  variance  (which  does  not  depend  on  x)  is  denoted 
i:(Y|X).  The  variance  of  Y  is  denoted  e(Y).  The  entropy  and  conditional 
entropy  of  Y  are 

H(Y)  =  ^  log  e(Y)  +  ^  (1  +  log  2:t) 

H(YIX)  =  i  log  E(YlX)  +]■(!+  log  2tt) 

The  information  about  Y  in  X  is  written 
KYjX)  =  -1  log  e'^(Y)  E(YlX) 

When  Y  and  X  are  jointly  multivariate  normal  random  vectors,  let  i 
denote  a  covariance  matrix.  One  can  show  that 

I(YIX)  =  (-  \)  log  det  e'^Y)  e{Y|X) 

=  (-  sum  log  eigenvalues  E  (Y)  e(Y|X). 
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To  make  the  foregoing  formulas  concrete,  and  to  describe  the  general 
approach  of  this  paper,  consider  the  general  problem  of  testing  the 
hypothesis  X  and  Y  are  independent.  One  could  express  in  any  one  of 
the  following  equivalent  ways: 


Ho-*  fY|x=x^y^  =  f^iy)  for  all  x  and  y; 


^o'  ^^^X,Y’  ^X^Y^  " 

Hq:  I(YlX)  =  0 

The  information  approach  to  testing  is  to  form  an  estimator  i(YlX) 
of  I(Y|X),  and  test  whether  it  is  significantly  different  from  zero.  One  can 
distinguish  several  types  of  estimators  of  1(Y|X):  (a)  fully  parametric,  i 

(b)  fully  non-parametric;  (c)  functionally  parametric  which  uses  functional 
statistical  inference  smoothing  techniques  to  estimate  I(YjX)  [see  Woodfield  (1983 
In  this  paper  we  consider  only  fully  parametric  estimators  based  on 
assuming  multivariate  normality  of  Y  and  X.  When  X  and  Y  are  bivariate 
normal  with  correlation  coefficient  o. 


I(YlX)  =  -  i  log  (l-p2). 

Given  a  random  sample  (Xj,  Yi),...,(X^  Y^)  the  maximum  likelihood  estimator 
of  I(YlX)  is 


where  p  is  the  sample  correlation  coefficient.  A  test  of  based  on  p 
would  reject  at  the  5%  level  of  significance  if  |p|  is  greater  than  the 
threshold  given  in  the  following  table: 


Sample 


ize  n 

Threshold  for  |p| 

Threshold  for 

20 

.444 

.11 

40 

.312 

.05 

50 

.279 

.04 

80 

.220 

.025 

100 

.197 

.02 

150 

.160 

.013 

200 

.139 

.01 

n 

? 

2/n 

I(Y|X) 


In  the  foregoing  table  one  sees  a  remarkable  regularity  in  the  5%  significance 

levels  for  the  estimated  information;  they  are  approximately  given  by  the 

simple  formula  2/n.  Test  statistics  based  on  entropy  have  5%  significance 

levels  obeying  the  approximate  rule  m/n  where  n  is  the  sample  size  and  m  is  a 

constant  which  varies  with  the  statistic  used.  At  this  time  this  perceived 

regularity  is  mainly  an  empirical  fact;  its  theoretical  basis  is  the  conjecture 

that  asymptotical ly  2n  I(Y|X)  has  a  Chi-squured  distribution  with  a  suitable 

number  m  of  degrees  of  freedom.  If  one  transforms  the  5%  significance  levels 

of  the  multiple  correlation  coefficient  to  significance  levels  for 

I  =  -  ^  log  (1-R  ),  one  discovers  that  the  transformed  critical  values 

approximately  obey  the  formula  (1  +  k)/n,  where  n  is  the  sample  size,  and  k 

is  the  number  of  regression  variables.  These  empirical  facts  support  the 

2 

recommendation  that  statisticans  should  in  their  thinking  replace  R  by 


information  I. 


> 

k 

h* 


fK 
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2.  Information  Formulation  of  Tests  for  White  Noise 

Let  {Y(t),  t=0,  +1,...}  be  a  zero  mean  stationary  Gaussian  time  series. 
The  information  about  the  value  Y(t)  at  time  t  in  the  m  most  recent  values 
Y( t-1 ) , . . . ,Y( t-m)  is  denoted 

=  I(Y(t)|  Y(t-l)....,Y(t-m)) 

It  is  more  convenient  to  write  henceforth 

>^='<''1''-! . V  ■ 

Introduce  now  the  following  notation  for  predictors  (conditional 
expectations) : 

(t)  =  E[Y(t)|Y(t-l),...,Y(t-m)]  =  (Y|Y_^ . 

Yv,m  ^  Y(t)  -  Y^’"’  (t) 

,2  .  eCIy"’'"  (t)12]  .  E[|Y(t)12] 

=  ^(Y1Y_^,...,Y_^)  z‘'(Y) 

The  information  I^  about  Y  in  Y  ^,...,Y  ^  satisfies 


I  =  -  i  log 
m  z  ^  m 


2- 


Q 


Next,  let  Y’  denote  the  infinite  past  Y(t-l),  Y(t-2),...,  and  let 


I  =  I(Y|Y-). 


One  can  show  that 


I  =  -  i  log  a2  =  (-  y)  /!  log  f(a.)  da 

CD  X  DD  ^  ^  Q 


where  f(oj)  is  the  spectral  density  function  of  the  time  series  Y(t)  satisfying 


p(v)  =  E[Y(t)  Y(t  +  v}]  .  E[Y2(t)] 


=  exp  (2TTivu))  f{a))  dio,  v  =  0,  +1,, 


One  of  the  powerful  properties  of  information  is  that  I  can  be 


evaluated  as  a  limit  of  I  : 

m 


1 im  Im  I 
m-w 


The  value  of  I  (in  the  Gaussian  case,  the  value  of  o^)  is  used  to 
classify  the  memory  type  of  the  time  series  as  defined  by  Parzen  (1981);  a 
stationary  (Gaussian)  time  series  Y(*)  is  defined  to  be: 


no  memory  if  I  =0  (o^  =  1 )  ; 

OO  OO 

short  memory  ifO<I  <;oo(0<o2<l); 

OO  00 

long  memory  if  (a^  =  0) 
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To  estimate  for  m=l,2,...,  and  also  from  a  sample  Y(t), 
t=l,2,...,T,  one  uses  the  same  estimators  as  if  one  were  fitting  an  auto¬ 
regressive  scheme  of  order  m  to  the  time  series: 


Y(t)  +  a^(l)  Y(t-l)  +...+  aj^(m)  Y(t-m)  =  c(t) 


where  c(t)  is  a  white  noise  time  series  with  variance  denoted 


al  -  E|e(t)|2  .  ElY{t}12  . 


We  do  not  explicitly  write  the  formulas  for  the  estimators 

m 


The  hypothesis 


H  :  Y(t)  is  white  noise 
0 


can  be  formulated  in  terms  of  information  measures  as 


=  0  for  m  =  1,2,... 

0  m 


For  any  fixed  m  to  test  the  hypothesis  that  1^^  =  0  one  forms  a  test  statistic 
of  the  form 


I  =  -  T  log  o2 
m  2  ^  m 


A  95%  significance  level  for  seems  to  be  approximately  equivalent  to  one 
of  the  form 


4- 


I 


m 


m* 


T 


where  T  is  the  time  series  sample  size  and  m*  is  a  suitable  constant  which 
depends  on  the  order  m  (of  the  predictor)  and  the  sample  size  T,  Two 
widely  used  formulas  for  m*  are  [see  Shibata  (1981)  for  references]: 

(1)  m*  =  m,  Akaike  criterion; 

(2)  m*  =  m  (log  log  T),  Hannan-Quinn  criterion. 

The  optimal  value  of  m*  for  a  given  order  m  could  be  determined  by 
Monte  Carlo  simulation.  However  we  need  a  sequence  of  thresholds  so 
that  the  test  region 


<  I 


m 


for  m  =  1,2,... 


provides  an  "optimum"  test  of  the  hypothesis  that  the  time  series  is  white 
noise.  In  choosing  the  critical  values  one  will  undoubtedly  use  random 
walk  theory  since  one  can  represent 

^m  "  “  I  '^m  "  i  (jM . j-l) 

where  p  ( j 1 1 , . . . . j-1 )  is  the  partial  correlation  coefficient  of  Y(t)  and 
Y(t-j)  conditioned  on  Y( t-1 ) , . . . ,Y(t-( j-1 ) ) .  The  sample  partial  correlation 
coefficients  p(j | 1 , j-1 )  are  asymptotically  independent  N(0,(l/n))  under 
the  hypothesis  Y(*).  Is  white  noise.  The  important  work  of  Anderson  (1971), 
p.  270, on  the  model  order  determination  problem  should  be  related  to  the 
random  walk  approach. 


Information  Formulation  of  ARf^A  Models 

A  white  noise  time  series  is  characterized  by  the  fact  that  the  past 


has  no  information  about  the  present.  An  autoregressive  of  order  p,  or 
AR(p),  time  series  can  be  defined  as  one  for  which  the  most  recent  p  values 
has  as  much  information  as  the  infinite  past.  In  symbols,  the  following  two 
hypotheses  are  equivalent: 

Y(-)  is  AR(p), 

Y-)  =  I^  -  Ip  =  0 

An  ARMA  (p,q)  scheme  is  usually  defined  by  the  representation 

Y(t)  +  ap(l)  Y(t-l)  +...+Op(p)  Y(t-p) 

“  c(t-l)  +...+6q(q)  e(t-q) 

where  the  polynomials 

gp(z)  =  1  +  ap(l)  Z  +...+ap(p)  z^ 

^q  (z)  =  1  Pqd)  2  +•••+  Bq(q)  Z^ 

are  chosen  so  that  all  their  roots  in  the  complex  z-plane  are  in  the  region 
{z:lz|>l}  outside  the  unit  circle. 

To  give  an  information  characterization  define  the  innovation  time  series 
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Y^(t)  =  Y(t)  -  Y^(t)  =  lim  Y''’"’(t), 

m-^ 

Y'^(t)  =  E[Y(t)|Y(t-l).  Y(t-2). . ]  =  (Y|Y‘)  (t) 

The  following  hypotheses  can  be  shown  to  be  equivalent; 

Hq.-  Y(.)  is  ARMA  (p.q). 

I(Y|Y_T,....Y_p,  Y^^.....Y^q;  Y")  =  0 
(YiY_^.....Y_p,  Y^^,  ....  Y^^)  (t)  =  {YlY")(t) 

To  compute  the  information  one  needs  to  compute  the  conditional  variance 
j;(Y|Y  i>'-*’Y_p,  .  ,Y^q).  To  do  this  in  practice  we  propose  the 

following  procedure: 

1.  Fit  an  AR(p)  of  order  p  determined  by  an  order  determination 
cri  terion. 

2.  Invert  the  AR(p)  to  form  its  MA(“>),  infinite  moving  average 
representation, 

Y{t)  =  Y^'lt)  +  6i  Y''{t-1)  +  32  Y''(t-2)  +  ... 
which  is  a  non-parametric  estimator  of  the  MA(“‘)  representaiton.  Note  that 

1  =  {1  +  6^  +  g2  +  __  } 

00  i  / 

and  that  the  correlations  p(v)  =  Corr  [Y(t),  Y(t+v)]  are  estimated  by 
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p(v)  =  +...} 

3.  Form  the  joint  covariance  matrix  of  Y(t),  Y(t-1 ) . . . ,Y(t-p) , 

Y'^(t-1 ) , . . .  ,Y^(t-q)  for  suitable  values  of  p  and  q.  By  using  matrix  sweep 
operators  one  can  form  the  desired  conditional  variance  =  Z  ^(Y) 

^(Y|Y_-i  >•  •  •  >Y_p,  Y_^ . Y_q). 

Note  that 


I(Y1Y_^ . ,Y_p,  Y_i,...,Y_p;  Y  )  =  -  Ip^^, 


I  ~  -  T  log  o2 
P.q  2  ^  p,q 


We  illustrate  this  procedure  by  stating  the  conclusion  for  an  ARMA(1,1); 


1  1  2/^^  ^Bi“p(l)}‘ 

I(V1V_,,V",;Y‘)  =  i  log  -  ^7^2 - 


One  can  verify  that  this  information  number  equals  0  if  the  time  series  obeys 
any  one  of  the  schemes  AR(1),  MA(1),  or  ARMA(1,1).  The  information  numbers 
for  an  AR(1)  and  MA(1)  are  respectively 


i(v|y.vY~)  =  ^  log  ’ 

oo 

I{VlY^-,;Y')  =  ^  log 


We  do  not  discuss  rigorously  the  method  by  which  one  chooses  the  best 
fitting  ARMA  (p,q).  The  method  introduced  by  Akaike  can  be  regarded  as 
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computing  for  each  p,q  an  estimator  I  of  information  from  which  one 

P»M 

subtracts  its  significance  level  (a  multiple  of  expected  value)  I 

P  »M 

under  the  hypothesis  of  white  noise.  Analogues  of  subset  regression  methods 
also  seem  to  work  in  practice,  and  are  used  in  our  time  series  programs 


ARSPID  and  TIMESBOARD. 


4.  Multiple  Time  Series  Model  Identification 

Let  Y  =  {Y(t),  t=0,  j^l,...}  be  a  multiple  zero  mean  Gaussian  stationary 
time  series.  One  seeks  to  model  Y(t)  in  terms  of  its  own  past  values,  and 
values  of  multiple  time  series  X  =  {X(t),  t=0,+l,  ...}  .  A  model  begins 
with  a  representation 

Y(t)  =  Y^(t)  +  Y^(t) 

where  Y‘^(t)  is  the  linear  predictor  of  Y(t)  given  specified  variables  in 
the  set  {Y(t-l),  Y(t-2),...,;  X(s),  s  =  0,  +  1, 

One  always  defines  Y'^(t)  =  Y(t)  -  Y^(t).  The  probability  law  of  the 
zero  mean  Gaussian  multiple  time  series  {Y^(t),  t=0,  +^1,  is  described 
by  the  sequence  of  prediction  error  covariance  matrices 

E  (v)  =  E[Y^(t){Y^(t  +  v)}*] 

fV 

where  *  denotes  the  complex  conjugate  of  a  matrix.  The  zero  lag  covariance 
I  (0)  is  used  in  the  evaluation  of  information.  This  matrix  is  written 

5:(Y  I  predictor  variables) 


to  indicate  clearly  which  variables  are  used.  We  now  describe  various 
important  information  numbers  and  how  they  are  computed  (sample  analogues  of 
the  following  formulas  are  used  for  estimation).  The  information  numbers 
we  form  are  of  the  form  I(Y|X)  or  I(Y|Xi;  Xi,  X;^)  where  X,  Xj,  X2  are  sets 
of  predictor  variables.  UYjX)  =  0  means  that  there  is  no  significant 


dependence  of  Y  on  the  variables  in  X;  I(Y|X)  >  0  means  that  one  can  predict  Y 
from  the  variables  in  X.  HYjXi;  Xi,  X2)  =  0  means  that  there  is  no 
information  about  Y  in  X2  in  addition  to  the  information  about  Y  in  Xi.  For 
each  information  number  we  list  two  hypotheses  and  H-j  which  the 
information  number  can  be  used  as  a  test  statistic  to  distinguish.  We 
write:  X~  to  denote  past  X  (the  set  X(t-1 X^  to  denote  past  and 
present  X  (the  set  X(t),  X(t-l),...);  X  to  denote  all  (past,  present,  and 
future)  X  (the  set  X(s),  s  =  0,  +  1, 

To  decide  which  explanatory  variables  to  use  in  modeling  Y  one  computes 
estimators  of  the  information  numbers  I(YlY“),  I(YlX“,Y”),  I (Y | X''^,Y”) , 
I(Y|X,Y”),  I(Y1X)  which  one  compares  with  their  respective  expected  values 
to  determine  which  information  number  most  exceeds  its  expected  or  threshold 
values. 

(1)  I(Y1Y”),  the  information  about  Y  in  the  infinite  past  of  Y,  is 

determined  by  computing  (using  Yule  Walker  equations)  for  p=l,2,... 

UYlY_^,...,Y_p)  =  (-  i)  log  det  j;‘^Y)  z(YiY_^ . Y_p) 

and  determining  an  order  p  such  that  the  value  of  the  information  about  Y(t)  in 
the  p  past  values  Y(t-1 ) , , . . ,Y(t-p)  is  used  as  an  estimator  of  the  information 
about  Y(t)  in  Y(t-l),  Y(t-2),...  .  This  estimator  statistics  the  general 
formul  a 

log  det  2;(Y|Y  )  =  log  det  fY(w)  du) 

if  the  spectral  density  matrix  of  Y(*)  is  estimated  by  the  autoregressive 
spectral  density  estimator  of  order  p. 
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For  use  in  ('J),  we  also  compute  at  this  stage  I(X|X'). 

(2)  I(Y|X",Y  ),  the  information  about  Y  in  the  infinite  past  of  X  and  Y 

is  determined  by  fitting  multiple  autoregressive  schemes  of  order  p  =  1,2,.. 

X(t) 

to  the  joint  time  series  '  ' 

LY(t), 

to  estimate  the  mean  square  prediction  error  matrices  5:(X,Y|X”,Y”).  It  is 
represented  as  a  partitioned  matrix 


which  are  used  (for  a  suitable  order  p) 


Z(X,Y|X',Y‘)  = 


-xx 


'XY 


'YX 


'YY 


where  =  z(X|X",Y'),  Zyy  =  i:(Y|X”,Y"),  z^^  is  the  conditional  covariance 

matrix  of  X  and  Y,  given  X  and  Y~.  Then 


I(Y|X',Y')  =  (-  \)  log  det  z’^Y)  z{YlX‘,Y’) 


We  also  compute  at  this  stage  I(xlx",Y”)  which  is  used  in  (5), 
The  approximating  autoregressive  scheme  is  also  used  to  estimate 
the  spectral  density  matrix 


^XX  ^XY 


^YX  ^YY 


which  is  used  in  (3),  and  coherency  C(a))  =  fYy(u))  fyj^(w)  f^J(u))fj^y(u)) . 


Several  important  identities  can  now  be  stated.  The  determinant  of  a 
partitioned  matrix  can  be  evaluated 
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log  det  i(X,YlX',Y”)  =  log  det  +  log  det  lyy  -  Eyj^  E^j^  Ej^y 
However  =  e(X|X~,Y”),  and  Parzen  (1969),  p.  402  shows  that 

E(Y|X'*’,Y")  =  Eyy  -  Ey^  Ej^][  Ej^y 


Thus  we  have  the  identity; 


(I)  log  det  E(X.YlX',Y")  =  log  det  e(X1X“‘Y‘)  +  log  det  e(Y|X  .Y 


+  y-) 


log  det  y(w)  =  log  det  Tog  det  -  fyj^(w)  fj(y{u)) 


Integrating  with  respect  to  oj  over  0_<(ij_<  1 ,  we  obtain  the  identity 


(II)  log  det  e(X,Y|X",Y")  =  log  det  e(X|X‘)  +  log  det  e(Y1X,Y‘) 


since  the  spectral  density  of  the  error  time  series  {Y|X)^(t)  =  Y(t)-(Y|X)(t) 


f  ^  (^)  ~  fyY((*))  ”  fyiijC*^)  (^) 

( Y 1 X) 

Identities  (I)  and  (II)  play  an  important  role  below  in  stage  (5);  their 
importance  may  have  been  first  pointed  out  by  Geweke  (1982),  Theorem  1. 
(3)  I(Y|X),  the  information  about  Y  in  all  of  X,  is  computed  by 


I(Y|X)  =  (-  i)  log  det  e'^Y)  e(Y|X) 


where 


^(Y|X)  =  /J  fYY(<^)  {I-C(u,)}du> 

“  /q  *  ^YX  ^XX  ^XY 

(4)  I(Y|X^,Y"),  the  information  about  Y  in  the  past  and  present  of  X 
and  the  past  of  Y  is  given  by 

I(Y|X'^,y')  =  (-  })  log  det  e‘''{Y)  e(Y1x‘',Y‘) 

where  5;(Y|X^,Y“)  =  Eyy  "  ^YX  ^XX  ^XY  terms  of  the  partitioned 
submatrices  appearing  in  E(X,YjX”,Y”)  computed  in  (2). 

(5)  I{Y|X,Y"),  the  information  about  Y  in  all  of  X  and  the  past  of  Y, 

is  computed  in  an  ingenious  manner  developed  by  econometricians  in  their  study 
of  feedback  measures  [See  Geweke  (1982)].  First 
I{Y|X,Y')  =  I(YIY')  +  I(Y1Y';X,Y‘) 

Next 

I(Y1Y“;X,Y')  =  KYlY’-.X'^.Y')  +  I  ( Y 1  X^,  Y"  ;X,Y' ) 

The  first  conditional  information  on  the  right  hand  side  is  computed 


I(Y|Y’;X‘^.Y‘)  =  I(Y|X‘'.Y")  -  I(Y|Y") 


in  terms  of  the  information  determined  in  (4)  and  (1)  respectively.  The 
second  conditional  information,  defined  by 

I(Y|X‘'.Y‘;X,Y‘)  =  I(Y1X,Y~)  -  I(Y|x\y‘)  , 

is  computed  by 

I(Y|X'',Y';X.Y")  =  I(X|X‘;X',Y”) 

^  ■kic'k-k'ic  j 

=  I(X|X',Y')  -  I(X|X") 

in  terms  of  information  computed  in  (2)  and  (1)  respectively.  A  proof  of 
equation  {*****)  is  based  on  the  identity 

log  det  i(X,Y|X',Y') 

=  log  det  i;(Y|x'^,Y’)  +  log  det  z:{X|X',Y") 

=  log  det  r.(YlX,  Y")  +  log  det  i:(X|X”) 
which  follows  from  (I)  and  (II)  in  stage  (2).  Therefore 
log  det  z(Y[X^,Y')  -  log  det  i;(Y|X,Y') 

=  log  det  l(X|X")  -  log  det  j:(X|X",Y") 


Summary.  A  method  of  summarizing  the  various  information  numbers  is 
provided  by  reporting  each  of  the  terms  in  the  following  information 
decomposition: 

I(Y1Y";X,Y")  =  KYIX.Y")  -  I(Y|Y')  = 

I(Y1Y‘;X'.Y')  +  I(Y1X‘,Y";X'^,Y")  +  I  (Y  |  X'^.Y"  ;X,Y") 

which  enables  one  to  construct  the  information  numbers  in  (1),  (2),  (4), 
and  (5).  One  also  reports  I(Y|X)  and  I(Y|X;X,Y~). 

The  difference  between  measures  of  information  is  illuminated  by 
expressing  them  when  possible  in  spectral  terms: 

I(Y1Y‘;X,Y")  =  /J  (-  j)  log  det  {I-C(u)}  du, 

I(Y|X;X,Y")  =  ^  log  det  fyyCu)  {I-C(u))}  dw 
-  !q  \  log  det  fyy{u))  {I-C(w)}  dw 

Causality  and  Feedback.  It  should  be  noted  that  notions  of  feedback 
and  causality  studied  by  econometricians  [see  Gewerke  (1982)]  can  be  easily 
defined  in  terms  of  information  numbers: 

measure  of  linear  dependence  is  I(Y|Y  -.X.Y  ) 

measure  of  linear  feedback  from  X  to  Y  is  I(Y|Y  ;X  ,Y  )  ; 

measure  of  instantaneous  linear  feedback  is  I(Y)X“,Y~;X^,Y"). 
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To  summarize  the  relations  between  two  multiple  time  series  X(*)  and 
Y( • )  one  estimates 

I.  Memory  Measures 
KXjX"),  I(Y|Y") 

II.  Feedback  Measures 

I(X|X‘;X',y'),  I(Y1Y‘;X‘,Y"),  I (Y j X' , Y" ;X^, Y' ) 

III.  Information  Increment  Measures 
I(Y1Y‘;X‘.Y‘) 

I(Y|Y‘;X'^,Y") 

I(Y|Y‘;X.Y‘) 

I(Y1X;X.Y') 

As  an  example,  let  us  consider  univariate  time  series  X  and  Y  which 
are  given  as  Series  J  by  Box  and  Jenkins  (1970);  X  is  gas  furnace  data,  and 
Y  is  CO2  in  output  gas.  The  time  series  sample  size  is  T  =  296.  The  means 
and  standard  deviations  are  given  by 


-.057 


53.51 


Standard  deviation  1.07  3.20 

The  ratio  of  standard  deviations  of  Y  to  X  is  about  3;  it  can  be  regarded 
as  a  gain  factor  by  which  a  change  in  X  is  multiplied  into  a  change  in  Y. 

The  multiple  covariances  R(v)  of  the  standardized  time  series  (Y,X) 
are  computed  for  v  =  0,1 . 24;  we  list  R(0),  R(l),  R(2),  R(3),  R{4),  R(5) 


85 
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1 
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The  order  determined  AR  schemes  are:  for  X;  order  6,  5:(X|X”)  =  .0302; 
for  Y:  order  4,  i;(YlY~)  =  .0183. 

The  order  determined  joint  AR  scheme  for  the  standardized  time  series 
(Y,X)  has  order  4  and  lyy  =  .0095,  =  .0306,  =  -.0021. 

Then  Zyy  -  Zj^y  =  .0093. 

The  spectral  regression  of  standardized  Y  on  all  of  standardized  X  has 
z(Y|X)  =  .0618. 

The  memory  measures  are  (formulas  apply  to  standardized  X  and  Y) 

I{X|X')  =  -.5  log  J:(X|X‘)  =  1.75, 

I(YlY’)  =  -.5  log  E(Y|Y")  =  2.00; 

one  concludes  that  each  time  series  has  long  menwry. 

The  feedback  measures  are 


I(Y1Y";X',Y')  =  .330 

I(YlX  ,Y  ;X^,Y~)  =  .008,  not  significantly  different  from  zero, 
I(X| X~;X~,Y~)  =  -.008,  not  significantly  different  from  zero. 


The  information  increment  measures  are 
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I(Y|Y';X",Y’)  =  .33 


I(YiY~;X  ,Y")  =  .33 


I(Y|Y~;X.Y‘)  =  .33 


I(Y|X;X,Y")  =  .94 


One  interprets  these  measures  to  mean  that  adding  Y”  to  X  adds  much  more 
information  than  adding  X  to  Y“.  Further  adding  X”  to  Y"  is  as  informative 
as  adding  all  X  to  Y  . 
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